In [ ]:
# convention recommended in documentation
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#enable inline plotting in notebook
%matplotlib inline
Let's start by reading in a dataset. This dataset is about different subclasses of the iris flower.
We'll use it for another exercise later on.
In [ ]:
df = pd.read_csv("../data/iris.data")
df = df.sample(frac=0.2) # only use 20% of the data so the results aren't so long
type(df)
DataFrame is the basic building block of Pandas. It represents two-dimensional data with labeled rows and columns.
In [ ]:
# Columns can have different types.
# you can check the data types of the values
df.dtypes
In [ ]:
# you can access the dataframe with a single column name
df["petal.width"]
# this leaves the original unmodified
In [ ]:
# then the returned type is a Series, the second major concept in Pandas
type(df["petal.width"])
In [ ]:
#
In [ ]:
# alternately you can index a dataframe with a list of column names
df[["sepal.length", "petal.width", "class"]]
In [ ]:
# the comparison operator returns a list of boolean
matching = df["sepal.width"] > df["petal.length"]
matching
In [ ]:
# which can also be used to query the dataframe
df[list_]
# or more idiomatically
df[df["sepal.width"] > df["petal.length"]]
In [ ]:
# one can get aggregates of single dimensions
df["sepal.width"].var() # try min, max, mean, median, sum, var
In [ ]:
# or of the whole thing
df.sum() # same operations as above
In [ ]:
# it's also possible to plot simple graphs using a simpleish syntax
df["sepal.width"].plot.box()
In [ ]:
df["sepal.width"].plot.hist()
In [ ]:
df.boxplot(column="sepal.width", by="class")
In [ ]:
df.plot.scatter(x="sepal.length", y="sepal.width")
In [ ]:
df.groupby("class").mean()
In [ ]:
# creating a grouped by plot requires a loop
fig, ax = plt.subplots(figsize=(8,6))
for label, df_ in df.groupby('class'):
df_["sepal.length"].plot(kind="kde", ax=ax, label=label)
plt.legend()
In [ ]:
Plot box plots of the data. Is there a value that is off?
In [ ]:
How do you remove it?